AITopics | deep voice 2

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Neural Information Processing SystemsMar-17-2026, 17:16:31 GMT

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

artificial intelligence, proceedings, speech synthesis, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.64)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Neural Information Processing SystemsNov-21-2025, 16:02:58 GMT

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

deep voice 2, multi-speaker neural text-to-speech, name change, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.64)

Add feedback

Reviews: Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Neural Information Processing SystemsOct-8-2024, 08:47:20 GMT

This paper presents a solid piece of work on the speaker-dependent neural TTS system, building on previous works of Deep Voice and Tacotron architecture. The key idea is to learn a speaker-dependent embedding vector jointly with the neural TTS model. The paper is clearly written, and the experiments are presented well. My comments are as follows. ASR researchers later find that using fixed speaker embeddings such i-vectors can work equally well (or even better).

deep voice 2, multi-speaker neural text-to-speech, speaker adaptation, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou

Neural Information Processing SystemsOct-4-2024, 05:22:43 GMT

Neural Information Processing Systems http://nips.cc/

deep voice 1, deep voice 2, tacotron, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.68)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Gibiansky, Andrew, Arik, Sercan, Diamos, Gregory, Miller, John, Peng, Kainan, Ping, Wei, Raiman, Jonathan, Zhou, Yanqi

Neural Information Processing SystemsFeb-14-2020, 11:44:31 GMT

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets.

deep voice 1, deep voice 2, multi-speaker neural text-to-speech, (2 more...)

Neural Information Processing Systems

Genre: Research Report (0.44)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)
Information Technology > Artificial Intelligence > Assistive Technologies (0.65)

Add feedback

Audio Book Excerpt: Timing, Extracts B & C (Richard Abbott)

#artificialintelligenceFeb-16-2019, 14:53:21 GMT

Today I'm pleased to present to readers what's next up in our series featuring author Richard Abbott, whose space jaunts have so delighted me--and many others. Of course, I'd previously reviewed Abbott's debut sci-fi novel, Far from the Spaceports, followed up by another for its sequel, Timing. The audio excerpts below come from the second novel and, like our previous entry, utilize Amazon's Polly software, which is enabled for text-to-speech in multiple accents and intonations. This compares to Alexa, a single voice. Before moving forward, for those unfamiliar with the novels and their plots, I've linked the book covers to their respective Amazon blurbs.

callisto, richard abbott, slate, (14 more...)

#artificialintelligence

Country:

Europe > United Kingdom > England > Greater London > London (0.05)
Europe > Middle East (0.05)
Asia > Middle East (0.05)
Africa > Middle East (0.05)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Baidu's new text-to-speech system can master hundreds of accents

#artificialintelligenceJan-16-2019, 10:16:44 GMT

There is a renaissance happening in the world of artificial intelligence. Using deep learning, researchers are producing systems that can recognize objects, understand spoken language, and even simulate the human voice. The quality of these systems is advancing at a blistering pace. Just three months months ago, Chinese search giant Baidu showed off Deep Voice, a system for turning text into speech. It could produce speech which was nearly indistinguishable from an actual human voice on the first listen, and do it in near real time.

artificial intelligence, machine learning, new text-to-speech system, (9 more...)

#artificialintelligence

Country: North America (0.17)

Genre: Summary/Review (0.40)

Industry: Information Technology (0.53)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Speech (0.89)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Gibiansky, Andrew, Arik, Sercan, Diamos, Gregory, Miller, John, Peng, Kainan, Ping, Wei, Raiman, Jonathan, Zhou, Yanqi

Neural Information Processing SystemsDec-31-2017

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

artificial intelligence, deep voice 2, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Baidu's text-to-speech system mimics a variety of accents 'perfectly'

EngadgetMay-25-2017, 21:00:25 GMT

Chinese tech giant Baidu's text-to-speech system, Deep Voice, is making a lot of progress toward sounding more human. The latest news about the tech are audio samples showcasing its ability to accurately portray differences in regional accents. The company says that the new version, aptly named Deep Voice 2, has been able to "learn from hundreds of unique voices from less than a half an hour of data per speaker, while achieving high audio quality." That's compared to the 20 hours hours of training it took to get similar results from the previous iteration, for a single voice, further pushing its efficiency past Google's WaveNet in a few months time. Baidu says that unlike previous text-to-speech systems, Deep Voice 2 finds shared qualities between the training voices entirely on its own, and without any previous guidance.

optical character recognition, speech synthesis, text-to-speech system, (4 more...)

Engadget

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)
Information Technology > Artificial Intelligence > Assistive Technologies (0.88)

Add feedback

[R] Deep Voice 2: Multi-Speaker Neural Text-to-Speech • r/MachineLearning

#artificialintelligenceMay-25-2017, 20:35:23 GMT

TL;DR Baidu's TTS system now supports multi-speaker conditioning, and can learn new speakers with very little data (a la LyreBird). I'm really excited about the recent influx of neural-net TTS systems, but all of the them seem to be too slow for real time dialog, or not publicly available, or both. Hoping that one of them gets a high quality open-source implementation soon!

machinelearning, social media, speech synthesis, (3 more...)

#artificialintelligence

Industry: Media > News (0.40)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)

Add feedback

Filters

Collaborating Authors

deep voice 2

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Reviews: Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Audio Book Excerpt: Timing, Extracts B & C (Richard Abbott)

Baidu's new text-to-speech system can master hundreds of accents

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Baidu's text-to-speech system mimics a variety of accents 'perfectly'

[R] Deep Voice 2: Multi-Speaker Neural Text-to-Speech • r/MachineLearning